Mining Frequently Changing Substructures from Historical Unordered XML Documents

نویسندگان

  • Q Zhao
  • S S. Bhowmick
  • M Mohania Y Kambayashi
چکیده

Recently, there is an increasing research efforts in XML data mining. These efforts largely assumed that XML documents are static. However, in many real applications, XML data are evolutionary in nature. In this paper, we focus on mining evolution patterns from historical XML documents. Specifically, we propose a novel approach to discover frequently changing structures (FCS) from a sequence of historical versions of unordered XML documents. The objective is to extract substructures that change frequently and significantly by analyzing structural evolution patterns of XML documents. We propose two algorithms based on a set of evolution metrics to extract FCS from the historical XML data. We also present a battery of optimization techniques to improve the space efficiency of our algorithms. Note that such structures cannot be extracted accurately and efficiently by repeatedly applying existing frequent substructure mining techniques on a sequence of snapshot data. FCS can be useful in several applications such as monitoring interesting structures in a specific domain, FCS-based classifier, indexing XML documents, and evolution-conscious XML query caching. Extensive experiments with both synthetic and real data show that the proposed algorithms are efficient and scalable and can discover FCS accurately.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

FASST Mining: Discovering Frequently Changing Semantic Structure from Versions of Unordered XML Documents

In this paper, we present a FASST mining approach to extract the frequently changing semantic structures (FASSTs), which are a subset of semantic substructures that change frequently, from versions of unordered XML documents. We propose a data structure, H-DOM, and a FASST mining algorithm, which incorporates the semantic issue and takes the advantage of the related domain knowledge. The distin...

متن کامل

FRACTURE mining: Mining frequently and concurrently mutating structures from historical XML documents

In the past few years, the fast proliferation of available XML documents has stimulated a great deal of interest in discovering hidden and nontrivial knowledge from XML repositories. However, to the best of our knowledge, none of existing work on XML mining has taken into account the dynamic nature of XML documents as online information. The present article proposes a novel type of frequent pat...

متن کامل

Discovering Pattern-Based Dynamic Structures from Versions of Unordered XML Documents

Existing works on XML data mining deal with snapshot XML data only, while XML data is dynamic in real applications. In this paper, we discover knowledge from XML data by taking account its dynamic nature. We present a novel approach to extract patternbased dynamic structures from versions of unordered XML documents. With the proposed dynamic metrics, the pattern-based dynamic structures are exp...

متن کامل

Mining Maximal Frequently Changing Subtree Patterns from XML Documents

Due to the dynamic nature of online information, XML documents typically evolve over time. The change of the data values or structures of an XML document may exhibit some particular patterns. In this paper, we focus on the sequence of changes to the structures of an XML document to find out which subtrees in the XML structure frequently change together, which we call Frequently Changing Subtree...

متن کامل

Canonical Forms for Labeled Trees and Their Applications in Frequent Subtree Mining

Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. In this paper, we first present two canonical forms for labeled rooted unordered trees–the breadth-first canonical form (BFCF) and the depth-first canonical form (DFCF). Then the canonical forms are applied to the frequent subtree mining problem. Based...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006